1 Introduction

In 2017 Arifu designed an experiment to test whether learners preferred narrative or fact based training.

A fertilizer training was designed with 2 variations ‘THUMB’ for fact based and ‘NARRATIVE’ for narrative based.

Your task is to analyze between the two variations which was more popular.

NB: Make the assumption the length of the training does not matter.

In addition feel free to generate any interesting insights from the dataset. Your output will be a write up and code used. Both the findings and an explanation of your method should be provided

2 Load the required libraries

library(infer)
library(readr) #to load csv data.
library(dplyr) #data manipulation
library(ggplot2)
library(plotly)
library(DataExplorer)
library(naniar)
library(powerMediation)#power analysis

#library(pwr)
library(broom)
library(DT)

seed=2010

3 Load the data

fertilizer_data<-read_csv('data/challenge 2 dataset (fertilizer).csv')

4 IDA

Introduce the data

introduce(fertilizer_data)
## # A tibble: 1 x 9
##    rows columns discrete_columns continuous_colu~ all_missing_col~
##   <int>   <int>            <int>            <int>            <int>
## 1  6731       7                6                1                0
## # ... with 4 more variables: total_missing_values <int>,
## #   complete_rows <int>, total_observations <int>, memory_usage <dbl>

Plot the data introduction

plot_intro(fertilizer_data)

Look at the columns that have missing values

miss_var_summary(fertilizer_data)
## # A tibble: 7 x 3
##   variable       n_miss pct_miss
##   <chr>           <int>    <dbl>
## 1 user_response     266    3.95 
## 2 message_out       116    1.72 
## 3 message_in         20    0.297
## 4 learner_id          0    0    
## 5 program_code        0    0    
## 6 variation_code      0    0    
## 7 created_at          0    0

Plot the missing data

plot_missing(fertilizer_data)

The columns that have missing data have less than 5% of their values missing and since we have relatively many observations we may just drop the observations that have these missing values.

# drop rows with missing values
fertilizer_data<-na.omit(fertilizer_data)

Look at the internal structure

glimpse(fertilizer_data)
## Observations: 6,352
## Variables: 7
## $ learner_id     <dbl> 164274, 164274, 164274, 164274, 164274, 164274,...
## $ program_code   <chr> "YARA", "YARA", "YARA", "YARA", "YARA", "YARA",...
## $ variation_code <chr> "NARRATIVE", "NARRATIVE", "NARRATIVE", "NARRATI...
## $ message_in     <chr> "YARA", "A", "A", "1", "A", "A", "A", "A", "A",...
## $ message_out    <chr> "(1/23) A healthy crop makes a wealthy farmer. ...
## $ created_at     <chr> "11/2/2017 14:06", "11/2/2017 14:08", "11/2/201...
## $ user_response  <chr> "A", "A", "1", "A", "A", "A", "A", "A", "2", "A...

5 Plot of distribution of variation code

fertilizer_data%>%
  ggplot(aes(variation_code))+
  geom_bar()

From the plot we can see that ‘THUMB’ for fact based training is more popular than ‘NARRATIVE’ for narrative based training.

But is the variation statistically signinficant? We will have to carry out A/B testing to prove this.

6 Research Question

In 2017 Arifu designed an experiment to test whether learners preferred narrative or fact based training.

A fertilizer training was designed with 2 variations ‘THUMB’ for fact based and ‘NARRATIVE’ for narrative based.

Your task is to analyze between the two variations which was more popular.

7 Hypothesis

H0: The difference in proportions of THUMB and NARRATIVE VARIATIONS is zero

HA: The proportion of users who preferred THUMB based training is greater than the proportion of users who preferred NARRATIVE base training.

8 Variable of interest

Variation code

9 Power Analysis

Let us determine the sample size

fertilizer_data%>%
  mutate()%>%
  select(variation_code)%>%table()%>%prop.table()
## .
## NARRATIVE     THUMB 
## 0.4458438 0.5541562
total_sample_size <- SSizeLogisticBin(
  
  p1 = 0.4458438,# conversion rate in August for control group/condition 
  
  p2 = 0.5541562, # expected conversion rate in August for test group/condition, assuming a 10 percentage point increase
  
  B = 0.5, # proportion of the sample data from the test condition/group (ideally 0.5)
  
  alpha = 0.05, # significance level/p-value. The level of probability at which it is agreed that the null hypothesis will be rejected. Conventionally set at 0.05.
  
  power = 0.8 # 1-Beta. The probability of rejecting the null hypothesis when it is false and the HA is true.

 )

total_sample_size
## [1] 667

10 Data Sampling

Now let us select a random sample of 667 trainings.

#set seed
set.seed(seed)

#generate 667 random observations 
fertilizer_sample_data <- fertilizer_data%>% 
  select(variation_code)%>%
  sample_n(667) 

#view
glimpse(fertilizer_sample_data)
## Observations: 667
## Variables: 1
## $ variation_code <chr> "NARRATIVE", "NARRATIVE", "NARRATIVE", "THUMB",...

11 Distribution

We can now observe the distributions of our two training categories.

ggplotly(
ggplot(data = fertilizer_sample_data, aes(x = variation_code)) +
  geom_bar())

There seems to be a higher preference for THUMB trainings.

12 Test statistic

difference in proportions

13 Observed difference in proportions

observed_diff_in_prop<-fertilizer_sample_data%>%
  group_by(variation_code)%>%
  tally()%>%summarise(diff(n))%>%pull()

#observed_diff_in_prop

#p_hat <- fertilizer_sample_data %>% 
 # specify(response = variation_code, success = "THUMB") %>% 
 # calculate(stat = "prop")
#p_hat
diff_prop_data<-fertilizer_sample_data%>%
  summarize(diff_in_prop=observed_diff_in_prop)

diff_prop_data
## # A tibble: 1 x 1
##   diff_in_prop
##          <int>
## 1           93

13.1 Simulated Data/ Bootstrap Distribution under Null Hypothesis

fertilizer_sample_data$constant<-1:nrow(fertilizer_sample_data)
#set.seed(seed)

#boot_dist_prop <-fertilizer_sample_data%>%
 # specify(constant~variation_code, success='THUMB')%>%
 #   hypothesize(null = 'independence') %>% 
 # generate(reps = 10)%>%
 # calculate(stat = 'prop', na.rm = TRUE, order = c('THUMB','NARRATIVE'))
##### test if sample is consistent with known population
#binom.test(x=fertilizer_sample_data$variation_code, p=0.513, alternative="greater")
#view
#glimpse(boot_dist_prop)
#unique(boot_dist_prop$stat)